Chinese Abbreviation Identification Using Abbreviation-Template
نویسنده
چکیده
Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definition-independent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identifying new abbreviations from existing ones, our solution is to add generalization capability to the abbreviation lexicon by replacing words with word classes and therefore create abbreviation-templates. By utilizing abbreviation-template features as well as context information, a SVM approach is employed as the classifier. The evaluation on a raw Chinese corpus obtains an encouraging performance. Our experiments further demonstrate the improvement after integrating with extended word clustering (We design it to enable a joint learning of word classes), morphological analysis, substring analysis and person name identification. To our knowledge, this is the first definition-independent machine learning approach for Chinese abbreviation identification.
منابع مشابه
Chinese Abbreviation Identification Using Abbreviation-Template Features and Context Information
Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definitionindependent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identif...
متن کاملA Chinese Dataset with Negative Full Forms for General Abbreviation Prediction
Abbreviation is a common phenomenon across languages, especially in Chinese. In most cases, if an expression can be abbreviated, its abbreviation is used more often than its fully expanded forms, since people tend to convey information in a most concise way. For various language processing tasks, abbreviation is an obstacle to improving the performance, as the textual form of an abbreviation do...
متن کاملAutomatic Chinese Abbreviation Generation Using Conditional Random Field
Boulder, Colorado, June 2009. c ©2009 Association for Computational Linguistics Automatic Chinese Abbreviation Generation Using Conditional Random Field Dong Yang, Yi-cheng Pan, and Sadaoki Furui Department of Computer Science Tokyo Institute of Technology Tokyo 152-8552 Japan {raymond,thomas,furui}@furui.cs.titech.ac.jp Abstract This paper presents a new method for automatically generating abb...
متن کاملCluster based Chinese abbreviation modeling
Abbreviations in Chinese are widely observed in Chinese spoken language. Automatic generation of Chinese abbreviations helps to improve Chinese natural language understanding systems and Chinese search engine. The abbreviation generation is treated as a character-based tagging problem. Due to limited training data, Chinese abbreviation generation suffers from data sparseness. Two types of strat...
متن کاملVocabulary expansion through automatic abbreviation generation for Chinese voice search
Long named entities are often abbreviated in oral Chinese language, and this usually leads to out-of-vocabulary(OOV) problems in speech recognition applications. The generation of Chinese abbreviations is much more complex than English abbreviations, most of which are acronyms and truncations. In this paper, we propose a new method for automatically generating abbreviations for Chinese named en...
متن کامل